Advance Analytics with R (UG 21-24)

Resources and Introduction

Ayush Patel

Before we start

Please Install the follwing packages if not already installed

install.packages("tidymodels")

install.packages("ISLR2")

install.packages("ISLR")



Access lecture slide from bit.ly/aar-ug

Warrior's armor(gusoku)
Source: Armor (Gusoku)

Hello

I am Ayush.

I am a researcher working at the intersection of data, law, development and economics.

I teach Data Science using R at Gokhale Institute of Politics and Economics

I am a RStudio (Posit) certified tidyverse Instructor.

I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.

Reach me

ayush.ap58@gmail.com

ayush.patel@gipe.ac.in

What will you learn?

Statistical Learning Techniques

  • What are Statistical Learning Techniques?

  • When to apply a given technique?

  • How to apply a given technique using R?

  • Ways to evaluate the performance of a technique (How well it serves your purpose ?).

What you will not learn

  • All statistical techniques that exists

  • All the mathematics behind a statistical technique.

Resources

There is no one fixed textbook of this course.

But here are some resources I will be using to teach:

Pre-requisite

  • Data Wrangling.

  • Data Visualization.

  • Exploratory Data Analysis.

  • Fundamental Stats - random variables, summary statistics, probability distributions, etc.

Why learn this stuff?

For the sake of curiosity, in order to affect outcomes; We are interested in knowing how something works, why something happens and what will happen.

Ways to think about Data Science methods and models

What is Statistical Learning? - ISLR

“Statistical learning refers to a vast set of tools for understanding data.”

“Broadly speaking,supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs.”

“With unsupervised statistical learning, there are inputs but no supervising output; nevertheless we can learn relationships and structure from such data.”

Types of Models - TMWR

“While this list is not exhaustive, most models fall into at least one of these categories:”

    1. Descriptive Models “The purpose of a descriptive model is to describe or illustrate characteristics of some data. The analysis might have no other purpose than to visually emphasize some trend or artifact in the data.”
    1. Inferential Models “The goal of an inferential model is to produce a decision for a research question or to explore a specific hypothesis, similar to how statistical tests are used.”
    1. Predictive Models “Sometimes data are modeled to produce the most accurate prediction possible for new data. Here, the primary goal is that the predicted values have the highest possible fidelity to the true value of the new data.”

Statistical Learning

Finding Nemo (function f)

In order to predict, estimate or classify a variable of interest using other variables , we are attempting to find out a how these other variables provide systemic information about the variable of interest.

\[Y = f(X) + e\]

Y is the variable of interest.

f(X) is the function (Nemo) that provides systemic information about Y.

e is the error term independt of X.

The essence of statistical learning is to estimate f(x)

Why Estimate f?

Either we need to predict or estimate some quantity.

Prediction

True representation:

\[Y = f(X) + e\]

Model: (We are OK with f being typically unknown - blackbox )

\[\hat Y = \hat f(x)\]

\[E(Y - \hat Y)^2 = E(f(X) + e - \hat f(X) )^2\\ = [f(X) - \hat f(X)]^2 + Var(e) \] Reducible and irreducible error.

The goal is to minimize the reducible error.

The Var(e) sets the upper bound on accuracy of your predictions.

Why Estimate f?

Either we need to predict or estimate some quantity.

Inference

Here instead of being worried about what will be Y we are more concerned about how are Y and X related. We can no more ignore the lack on knowledge about form of f(x)

We are increasingly concerned about the relationship of response variable and each independent variable.

How to Estimate f?

Parametric

  • Assume a functional form for f

\[f(X) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n\]

or

\[f(X) = \beta_0 + \beta_1 X_1^2 + \beta_2 X_2^2 + ... + \beta_n X_n^2 \]

  • Use training data to fit the model. Ex : least square method to fit the first equation.

  • Now the goal is to find n+1 coeffs or parameters instead of estimating an arbitrary n dimensional function.

  • Disadvantage: Our choice of model might not potentially match the true form of the function. If these two are very different, quality of estimates is poor.

How to Estimate f?

Non-Parametric

  • No assumption about the form of the function.
  • Instead they seek an estimate of f that gets as close to the data points as possible without being too rough or wiggly.
  • Since there is no assumption about functional form, it is more flexible compared to parametric approaches.
  • However, since no assumption is made about the form of function, a large number of observations are required for non-parametric methods to generate accurate estimate of f.

Assesing Model Accuracy

Does it fit?

For quantitative response variables

\[ MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat y_i)^2 \]

\[\hat y_i = \hat f(x_i)\]

Should you care about training MSE or test MSE?

Should you use training MSE if you don’t have test data?

Bais - Variance Trade-off

For quantitative response variables

\[E(y_0 - \hat f(x_0))^2 = Var(\hat f(x_0)) + [Bias(\hat f(x_0))]^2 + Var(e)\]

Does it fit?

For qualitative response variables

\[\frac{1}{n}\sum_{i=0}^{n}(y_i \not = \hat y_i)\]

Reading and Acknowledgements

Read Chapters 1 and 2 of Intro to statistical learning with R

Thank you